Treatment discovery based on CGH analysis

ABSTRACT

Methods, systems and computer readable media for discovering a combination of treatments to reduce the progress of, or eliminate a tissue malady. Gene expression values of at least one sample of tissue exhibiting the tissue malady and at least one reference sample tissue that does not exhibit the malady are measured using at least one CGH array designed to measure gene sequences and possible variations in gene sequences attributable to the malady. Gene expression signatures are generated from differential expression values of ratios of the measured gene expression values between the at least one sample exhibiting the malady and the at least one reference sample, across all samples, respectively. The tissue samples exhibiting the malady are treated with a treatment, and a treatment-response value is measured with respect to each of the tissue samples treated, as effected by the treatment. A phenotypic signature representing the treatment-response values of each of the tissue samples treated is generated for characterizing the effects of the treatment on the tissues treated. Processing may be repeated with a different treatment at least once so that multiple phenotypic signatures have been generated for multiple treatments. A clustering operation is then based on the gene expression signatures of the differential expression levels and the phenotypic signatures of the treatment-response values together, and treatments are selected by identifying the treatment-response phenotypic signatures caused by those treatments, and which are clustered with gene expression signatures representing differential expression levels representative of the at least one tissue sample exhibiting the malady.

CROSS-REFERENCE

This application is a continuation-in-part application of application Ser. No. 10/640,081, filed Aug. 13, 2003, which is incorporated herein by reference in its entirety and to which application we claim priority under 35 USC §120.

BACKGROUND OF THE INVENTION

Many genomic and genetic studies are directed to the identification of differences in gene dosage or expression among cell populations for the study and detection of disease. For example, many malignancies involve the gain or loss of DNA sequences (alterations in copy number), sometimes entire chromosomes, that may result in activation of oncogenes or inactivation of tumor suppressor genes. Identification of the genetic events leading to neoplastic transformation and subsequent progression can facilitate efforts to define the biological basis for disease, improve prognostication of therapeutic response, and permit earlier tumor detection. In addition, perinatal genetic problems frequently result from loss or gain of chromosome segments such as trisomy 21 or the micro deletion syndromes. Trisomy of chromosome 13 results in Patau syndrome. Abnormal numbers of sex chromosomes result in various developmental disorders. Thus, methods of prenatal detection of such abnormalities can be helpful in early diagnosis of disease.

Comparative genomic hybridization (CGH) is a technique that is used to evaluate variations in genomic copy number in cells. In one implementation of CGH, genomic DNA is isolated from normal reference cells, as well as from test cells (e.g., tumor cells). The two nucleic acids are differentially labeled and then simultaneously hybridized in situ to metaphase chromosomes of a reference cell. Chromosomal regions in the test cells which are at increased or decreased copy number can be identified by detecting regions where the ratio of signal from the two distinguishably labeled nucleic acids is altered. For example, those regions that have been decreased in copy number in the test cells will show relatively lower signal from the test DNA that the reference shows, compared to other regions of the genome. Regions that have been increased in copy number in the test cells will show relatively higher signal from the test DNA.

A recent technology development introduced an oligonucleotide array platform for array based comparative genomic hybridization (aCGH) analyses. Such approaches offer benefits over immobilized chromosome approaches, including a higher resolution, as defined by the ability of the assay to localize chromosomal alterations to specific areas of the genome. For further detailed description regarding aCGH technology, the reader is referred to co-pending application Ser. No. 10/744,595 filed Dec. 22, 2003 and titled “Comparative Genomic Hybridization Assays Using Immobilized Oligonucleotide Features and Compositions for Practicing the Same”, which is incorporated herein, in its entirety, by reference thereto.

There is a continuing need for techniques and systems for using array CGH technology not only for analyzing disease states and associated locations of chromosomal alterations, but also for discovering potential treatments for the underlying diseases.

SUMMARY OF THE INVENTION

Methods, systems and computer readable media for discovering a combination of treatments to reduce the progress of, or eliminate a tissue malady, are provided to include the steps of: (a) measuring gene expression values of at least one sample of tissue exhibiting the tissue malady and at least one reference sample tissue that does not exhibit the malady, using at least one CGH array designed to measure gene sequences and possible variations in gene sequences attributable to the malady; (b) generating gene expression signatures from differential expression values of ratios of the measured gene expression values between the at least one sample exhibiting the malady and the at least one reference sample, across all samples, respectively; (c) treating the at least one tissue sample exhibiting the malady with a treatment; (d) measuring a treatment-response value with respect to each of the tissue samples treated, as effected by the treatment; (e) generating a phenotypic signature representing the treatment-response values of each of the tissue samples treated; (f) repeating steps (c)-(e) with a different treatment at least once so that multiple phenotypic signatures have been generated for multiple treatments; (g) performing a clustering operation based on the gene expression signatures of the differential expression levels and the phenotypic signatures of the treatment-response values together; and (h) selecting treatments by identifying the treatment-response phenotypic signatures caused by those treatments, and which are clustered with gene expression signatures representing differential expression levels representative of the at least one tissue sample exhibiting the malady.

Method, systems and computer readable media are provided for screening a combination of treatments to select treatments for tissue exhibiting a malady, to include the steps of: (a) providing differential expression levels of tissue samples exhibiting the malady relative to at least one reference tissue sample from respective features of CGH arrays designed to measure gene sequences and possible variations in gene sequences attributable to the malady; (b) for respective differential expression levels from respective features of respective CGH arrays for each tissue sample exhibiting the malady, providing a gene expression signature representing the differential expression level for each tissue sample for gene expression levels from that feature, respectively; (c) providing a treatment-response value, for each tissue sample exhibiting the malady having been treated with a treatment, as effected by the treatment; (d) generating a phenotypic signature representing the treatment-response values of each of the tissue samples having been treated; (e) repeating steps (c)-(d) with a different treatment at least once so that multiple phenotypic signatures have been generated for multiple treatments; (f) performing a clustering operation based on the gene expression signatures of the differential expression levels and the phenotypic signatures of the treatment-response values together; and (g) selecting treatments by identifying the treatment-response phenotypic signatures caused by those treatments, and which are clustered with gene expression signatures representing differential expression levels representative of the tissue samples exhibiting the malady.

Methods, systems and computer readable media for augmenting an original or existing single treatment or treatment combination for a disease with at least one additional treatment that covers gene activity of the disease not addressed by the original or existing treatment are provided to include the steps of: (a) providing differential expression levels of diseased tissue samples relative to at least one reference tissue for respective features of CGH arrays designed to measure gene sequences and possible variations in gene sequences attributable to the disease; (b) for respective features of respective CGH arrays for each diseased tissue sample, providing a gene expression signature representing the differential expression level for each tissue sample for that feature, respectively; (c) treating the diseased tissue samples with the original or existing single treatment or combination treatment; (d) measuring a treatment-response value with respect to each of the diseased tissue samples as effected by the original or existing single or combination treatment; (e) generating a phenotypic signature representing the treatment-response values of each of the diseased tissue samples as effected by the original or existing single or combination treatment; (f) treating the diseased tissue samples with a treatment that is not included in the original or existing single or combination treatment; (g) measuring a treatment-response value with respect to each of the diseased tissue samples as effected by the treatment that is not included in the original or existing single or combination treatment; (h) generating a phenotypic signature representing the treatment-response values of each of the diseased tissue samples as effected by the treatment that is not included in the original or existing single or combination treatment; (i) repeating steps (f)-(h) with a different treatment that is also not included in the original or existing single or combination treatment at least once so that multiple phenotypic signatures have been generated for multiple treatments not included in the original or existing single or combination treatment; (j) performing a clustering operation based on the gene expression signatures of the differential expression levels and the phenotypic signatures of the treatment-response values together; and (k) selecting at least one treatment by identifying the treatment-response phenotypic signatures caused by the at least one treatment, and which are clustered with phenotypic signatures identifying the treatment-response phenotypic signatures caused by the treatment or treatments in the original treatment, as well as with gene expression signatures representing differential expression levels representative of the diseased tissue samples, but separated from the phenotypic signatures identifying the treatment-response phenotypic signatures caused by the treatment or treatments in the original treatment, so as to address disease-gene activity not currently addressed by the treatment or treatments in the original or existing treatment.

These and other advantages and features of the invention will become apparent to those persons skilled in the art upon reading the details of the methods, systems and computer readable media as more fully described below.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a block diagram of a cancer transcriptome model.

FIG. 1B shows a modification of the model of FIG. 1B.

FIG. 2 is a flowchart illustrating steps that may be performed for determining potential treatments for diseases or other maladies that may be caused by altered mRNA resulting from chromosome alteration.

FIG. 3 shows exemplary plots of gene expression signatures of measurements of genes in altered or abnormal tissue, relative to “normal” tissue across multiple samples.

FIG. 4 shows an example of a gene expression signature plotted together with a phenotypic response signature or profile.

FIG. 5 shows a matrix in which data points used to generate gene expression signatures, phenotypic response signatures and inverted phenotypic response signatures have been inserted as cell values of the matrix.

FIG. 6 illustrates plots of three treatment/phenotypic profiles from samples having been treated with three different treatments that were determined to be in synchronization with the gene expression profile shown.

FIG. 7 is a schematic representation of an ellipsoid that represents the plot of a cluster of vectors from a matrix, such as the matrix shown in FIG. 5. As noted, this is a schematic representation, as, in reality, data ellipsoids are hyper-dimensional with complicated radial and angular geometries.

FIG. 8 is a block diagram illustrating an example of a computer system that may be employed in carrying out the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Before the present systems, methods and computer readable media are described, it is to be understood that this invention is not limited to particular treatments, drugs, diseases, methods, method steps, statistical methods, hardware or software described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.

Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited.

It must be noted that as used herein and in the appended claims, the singular forms “a”, “and”, and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a compound” includes a plurality of such compounds and reference to “the expression profile” includes reference to one or more expression profiles and equivalents thereof known to those skilled in the art, and so forth.

The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed.

DEFINITIONS

A “genotype” refers to the actual makeup of one or more genes (DNA) in living tissue. A genotypic signature is a textual or electronic representation that directly identifies the genotype.

A “phenotype” is related to a genotype, in that it is some sort of physical expression resulting from a blueprint provided by the genotype. A phenotypic signature is a textual or electronic representation of values representing the expression that defines the phenotype. mRNA expression might be considered to be either a genotype or phenotype, as it is in a gray area where the genotype executes the phenotype. It is referred to herein as genotype/phenotype.

A “treatment” refers to the administration of an agent to living tissue (generally a diseased tissue) that has some measurable effect on protein production by that tissue, which effect can be inferred by measurement of gene expression levels of the tissue, using microarray technology. “Treatments” may refer to, but are not limited to drugs, compounds, genetic sequences used to target specific locations of the genetic makeup of the tissue, radiation, heat, cryogenics, or any other kind of application that produces an effect as described above.

The term “transcriptome” refers to the set of all messenger RNA (mRNA) (or transcripts) in one or a population of biological cells.

The term “altered transcriptome” refers to a transcriptome resulting from a tissue sample containing one or more regions of abnormality in one or more chromosome (e.g., amplification, deletion). A “cancer transcriptome” refers to the altered transcriptome of a biological cell sample that is cancerous.

A “biopolymer” is a polymer of one or more types of repeating units. Biopolymers are typically found in biological systems and particularly include polysaccharides (such as carbohydrates), and peptides (which term is used to include polypeptides and proteins) and polynucleotides as well as their analogs such as those compounds composed of or containing amino acid analogs or non-amino acid groups, or nucleotide analogs or non-nucleotide groups. This includes polynucleotides in which the conventional backbone has been replaced with a non-naturally occurring or synthetic backbone, and nucleic acids (or synthetic or naturally occurring analogs) in which one or more of the conventional bases has been replaced with a group (natural or synthetic) capable of participating in Watson-Crick type hydrogen bonding interactions. Polynucleotides include single or multiple stranded configurations, where one or more of the strands may or may not be completely aligned with another.

A “nucleotide” refers to a sub-unit of a nucleic acid and has a phosphate group, a 5 carbon sugar and a nitrogen containing base, as well as functional analogs (whether synthetic or naturally occurring) of such sub-units which in the polymer form (as a polynucleotide) can hybridize with naturally occurring polynucleotides in a sequence specific manner analogous to that of two naturally occurring polynucleotides. For example, a “biopolymer” includes DNA (including cDNA), RNA, oligonucleotides, and PNA and other polynucleotides as described in U.S. Pat. No. 5,948,902 and references cited therein (all of which are incorporated herein by reference), regardless of the source. An “oligonucleotide” generally refers to a nucleotide multimer of about 10 to 100 nucleotides in length, while a “polynucleotide” includes a nucleotide multimer having any number of nucleotides. A “biomonomer” references a single unit, which can be linked with the same or other biomonomers to form a biopolymer (for example, a single amino acid or nucleotide with two linking groups one or both of which may have removable protecting groups).

A “chemical array”, “microarray”, “bioarray” or “array”, A chemical “array”, unless a contrary intention appears, includes any one, two or three-dimensional arrangement of addressable regions bearing a particular chemical moiety or moieties (for example, biopolymers such as polynucleotide sequences) associated with that region, where the chemical moiety or moieties are immobilized on the surface in that region. By “immobilized” is meant that the moiety or moieties are stably associated with the substrate surface in the region, such that they do not separate from the region under conditions of using the array, e.g., hybridization and washing and stripping conditions. As is known in the art, the moiety or moieties may be covalently or non-covalently bound to the surface in the region. For example, each region may extend into a third dimension in the case where the substrate is porous while not having any substantial third dimension measurement (thickness) in the case where the substrate is non-porous. An array may contain more than ten, more than one hundred, more than one thousand more than ten thousand features, or even more than one hundred thousand features, in an area of less than 20 cm² or even less than 10 cm². For example, features may have widths (that is, diameter, for a round spot) in the range of from about 10 μm to about 1.0 cm. In other embodiments each feature may have a width in the range of about 1.0 μm to about 1.0 mm, such as from about 5.0 μm to about 500 μm, and including from about 10 μm to about 200 μm. Non-round features may have area ranges equivalent to that of circular features with the foregoing width (diameter) ranges. A given feature is made up of chemical moieties, e.g., nucleic acids, that bind to (e.g., hybridize to) the same target (e.g., target nucleic acid), such that a given feature corresponds to a particular target. At least some, or all, of the features are of different compositions (for example, when any repeats of each feature composition are excluded the remaining features may account for at least 5%, 10%, or 20% of the total number of features). Interfeature areas will typically (but not essentially) be present which do not carry any polynucleotide. Such interfeature areas typically will be present where the arrays are formed by processes involving drop deposition of reagents but may not be present when, for example, light directed synthesis fabrication processes are used. It will be appreciated though, that the interfeature areas, when present, could be of various sizes and configurations. An array is “addressable” in that it has multiple regions (sometimes referenced as “features” or “spots” of the array) of different moieties (for example, different polynucleotide sequences) such that a region at a particular predetermined location (an “address”) on the array will detect a particular target or class of targets (although a feature may incidentally detect non-targets of that feature). The target for which each feature is specific is, in representative embodiments, known. An array feature is generally homogenous in composition and concentration and the features may be separated by intervening spaces (although arrays without such separation can be fabricated).

In the case of an array, the “target” will be referenced as a moiety in a mobile phase (typically fluid), to be detected by probes (“target probes”) which are bound to the substrate at the various regions. However, either of the “target” or “target probes” may be the one which is to be detected by the other (thus, either one could be an unknown mixture of polynucleotides to be detected by binding with the other). “Addressable sets of probes” and analogous terms refer to the multiple regions of different moieties supported by or intended to be supported by the array surface.

The term “sample” as used herein relates to a material or mixture of materials, containing one or more components of interest. Samples include, but are not limited to, samples obtained from an organism or from the environment (e.g., a soil sample, water sample, etc.) and may be directly obtained from a source (e.g., such as a biopsy or from a tumor) or indirectly obtained e.g., after culturing and/or one or more processing steps. In one embodiment, samples are a complex mixture of molecules, e.g., comprising at least about 50 different molecules, at least about 100 different molecules, at least about 200 different molecules, at least about 500 different molecules, at least about 1000 different molecules, at least about 5000 different molecules, at least about 10,000 molecules, etc. Given the significant, somewhat chaotic alteration of the genome by cancer, one expects some abnormal distribution in expression files of those tissue samples affected by cancer.

The term “genome” refers to all nucleic acid sequences (coding and non-coding) and elements present in any virus, single cell (prokaryote and eukaryote) or each cell type in a metazoan organism. The term genome also applies to any naturally occurring or induced variation of these sequences that may be present in a mutant or disease variant of any virus or cell type. These sequences include, but are not limited to, those involved in the maintenance, replication, segregation, and higher order structures (e.g. folding and compaction of DNA in chromatin and chromosomes), or other functions, if any, of the nucleic acids as well as all the coding regions and their corresponding regulatory elements needed to produce and maintain each particle, cell or cell type in a given organism.

For example, the human genome consists of approximately 3.0×10⁹ base pairs of DNA organized into distinct chromosomes. The genome of a normal diploid somatic human cell consists of 22 pairs of autosomes (chromosomes 1 to 22) and either chromosomes X and Y (males) or a pair of chromosome Xs (female) for a total of 46 chromosomes. A genome of a cancer cell may contain variable numbers of each chromosome in addition to deletions, rearrangements and amplification of any sub-chromosomal region or DNA sequence. In certain aspects, a “genome” refers to nuclear nucleic acids, excluding mitochondrial nucleic acids; however, in other aspects, the term does not exclude mitochondrial nucleic acids. In still other aspects, the “mitochondrial genome” is used to refer specifically to nucleic acids found in mitochondrial fractions.

By “genomic source” is meant the initial nucleic acids that are used as the original nucleic acid source from which the probe nucleic acids are produced, e.g., as a template in the nucleic acid amplification and/or labeling protocols.

If a surface-bound polynucleotide or probe “corresponds to” a chromosomal region, the polynucleotide usually contains a sequence of nucleic acids that is unique to that chromosomal region. Accordingly, a surface-bound polynucleotide that corresponds to a particular chromosomal region usually specifically hybridizes to a labeled nucleic acid made from that chromosomal region, relative to labeled nucleic acids made from other chromosomal regions.

An “array layout” or “array characteristics”, refers to one or more physical, chemical or biological characteristics of the array, such as positioning of some or all the features within the array and on a substrate, one or more feature dimensions, or some indication of an identity or function (for example, chemical or biological) of a moiety at a given location, or how the array should be handled (for example, conditions under which the array is exposed to a sample, or array reading specifications or controls following sample exposure).

The phrase “oligonucleotide bound to a surface of a solid support” or “probe bound to a solid support” or a “target bound to a solid support” refers to an oligonucleotide or mimetic thereof, e.g., PNA, LNA or UNA molecule that is immobilized on a surface of a solid substrate, where the substrate can have a variety of configurations, e.g., a sheet, bead, particle, slide, wafer, web, fiber, tube, capillary, microfluidic channel or reservoir, or other structure. In certain embodiments, the collections of oligonucleotide elements employed herein are present on a surface of the same planar support, e.g., in the form of an array. It should be understood that the terms “probe” and “target” are relative terms and that a molecule considered as a probe in certain assays may function as a target in other assays.

As used herein, a “test nucleic acid sample” or “test nucleic acids” refer to nucleic acids comprising sequences whose quantity or degree of representation (e.g., copy number) or sequence identity is being assayed. Similarly, “test genomic acids” or a “test genomic sample” refers to genomic nucleic acids comprising sequences whose quantity or degree of representation (e.g., copy number) or sequence identity is being assayed.

As used herein, a “reference nucleic acid sample” or “reference nucleic acids” refers to nucleic acids comprising sequences whose quantity or degree of representation (e.g., copy number) or sequence identity is known. Similarly, “reference genomic acids” or a “reference genomic sample” refers to genomic nucleic acids comprising sequences whose quantity or degree of representation (e.g., copy number) or sequence identity is known. A “reference nucleic acid sample” may be derived independently from a “test nucleic acid sample,” i.e., the samples can be obtained from different organisms or different cell populations of the sample organism. However, in certain embodiments, a reference nucleic acid is present in a “test nucleic acid sample” which comprises one or more sequences whose quantity or identity or degree of representation in the sample is unknown while containing one or more sequences (the reference sequences) whose quantity or identity or degree of representation in the sample is known. The reference nucleic acid may be naturally present in a sample (e.g., present in the cell from which the sample was obtained) or may be added to or spiked in the sample.

If a surface-bound polynucleotide or probe “corresponds to” a chromosome, the polynucleotide usually contains a sequence of nucleic acids that is unique to that chromosome. Accordingly, a surface-bound polynucleotide that corresponds to a particular chromosome usually specifically hybridizes to a labeled nucleic acid made from that chromosome, relative to labeled nucleic acids made from other chromosomes. Array features, because they usually contain surface-bound polynucleotides, can also correspond to a chromosome.

A “non-cellular chromosome composition” is a composition of chromosomes synthesized by mixing pre-determined amounts of individual chromosomes. These synthetic compositions can include selected concentrations and ratios of chromosomes that do not naturally occur in a cell, including any cell grown in tissue culture. Non-cellular chromosome compositions may contain more than an entire complement of chromosomes from a cell, and, as such, may include extra copies of one or more chromosomes from that cell. Non-cellular chromosome compositions may also contain less than the entire complement of chromosomes from a cell.

“CGH” or “Comparative Genomic Hybridization” refers generally to techniques for identification of chromosomal alterations (such as in cancer cells, for example). Using CGH, ratios between tumor or test sample and normal or control sample enable the detection of chromosomal amplifications and deletions of regions that may include oncogenes and tumor suppressive genes, for example.

A “CGH array” or “aCGH array” refers to an array that can be used to compare DNA samples for relative differences in copy number. In general, an aCGH array can be used in any assay in which it is desirable to scan a genome with a sample of nucleic acids. For example, an aCGH array can be used in location analysis as described in U.S. Pat. No. 6,410,243, the entirety of which is incorporated herein. In certain aspects, a CGH array provides probes for screening or scanning a genome of an organism and comprises probes from a plurality of regions of the genome. In one aspect, the array comprises probe sequences for scanning an entire chromosome arm, wherein probes targets are separated by at least about 500 bp, at least about 1 kb, at least about 5 kb, at least about 10 kb, at least about 25 kb, at least about 50 kb, at least about 100 kb, at least about 250 kb, at least about 500 kb and at least about 1 Mb. In another aspect, the array comprises probes sequences for scanning an entire chromosome, a set of chromosomes, or the complete complement of chromosomes forming the organism's genome. By “resolution” is meant the spacing on the genome between sequences found in the probes on the array. In some embodiments (e.g., using a large number of probes of high complexity) all sequences in the genome can be present in the array. The spacing between different locations of the genome that are represented in the probes may also vary, and may be uniform, such that the spacing is substantially the same between sampled regions, or non-uniform, as desired. An assay performed at low resolution on one array, e.g., comprising probe targets separated by larger distances, may be repeated at higher resolution on another array, e.g., comprising probe targets separated by smaller distances.

In certain aspects, in constructing the arrays, both coding and non-coding genomic regions are included as probes, whereby “coding region” refers to a region comprising one or more exons that is transcribed into an mRNA product and from there translated into a protein product, while by non-coding region is meant any sequences outside of the exon regions, where such regions may include regulatory sequences, e.g., promoters, enhancers, untranslated but transcribed regions, introns, origins of replication, telomeres, etc. In certain embodiments, one can have at least some of the probes directed to non-coding regions and others directed to coding regions. In certain embodiments, one can have all of the probes directed to non-coding sequences. In certain embodiments, one can have all of the probes directed to coding sequences. In certain other aspects, individual probes comprise sequences that do not normally occur together, e.g., to detect gene rearrangements, for example, as in a case where the mangled or abnormal genome, resulting from cancer, is expected to be less successful at expressing the normal distribution of mRNA that is normally observed for tissue of the same type that has not been affected by cancer, and therefore also less successful at producing translation mRNA that is normally observed.

In some embodiments, at least 5% of the polynucleotide probes on the solid support hybridize to regulatory regions of a nucleotide sample of interest while other embodiments may have at least 30% of the polynucleotide probes on the solid support hybridize to exonic regions of a nucleotide sample of interest. In yet other embodiments, at least 50% of the polynucleotide probes on the solid support hybridize to intergenic (e.g., non-coding) regions of a nucleotide sample of interest. In certain aspects, probes on the array represent random selection of genomic sequences (e.g., both coding and non-coding). However, in other aspects, particular regions of the genome are selected for representation on the array, e.g., such as CpG islands, genes belonging to particular pathways of interest or whose expression and/or copy number are associated with particular physiological responses of interest (e.g., disease, such as cancer, drug resistance, toxological responses and the like). In certain aspects, where particular genes are identified as being of interest, intergenic regions proximal to those genes are included on the array along with, optionally, all or portions of the coding sequence corresponding to the genes. In one aspect, at least about 100 bp, 500 bp, 1,000 bp, 5,000 bp, 10,000 kb or even 100,000 kb of genomic DNA upstream of a transcriptional start site is represented on the array in discrete or overlapping sequence probes. In certain aspects, at least one probe sequence comprises a motif sequence to which a protein of interest (e.g., such as a transcription factor) is known or suspected to bind.

In certain aspects, repetitive sequences are excluded as probes on the arrays. However, in another aspect, repetitive sequences are included.

The choice of nucleic acids to use as probes may be influenced by prior knowledge of the association of a particular chromosome or chromosomal region with certain disease conditions. International Application WO 93/18186 provides a list of exemplary chromosomal abnormalities and associated diseases, which are described in the scientific literature. Alternatively, whole genome screening to identify new regions subject to frequent changes in copy number can be performed using the methods of the present invention discussed further below.

In some embodiments, previously identified regions from a particular chromosomal region of interest are used as probes. In certain embodiments, the array can include probes which “tile” a particular region (e.g., which have been identified in a previous assay or from a genetic analysis of linkage), by which is meant that the probes correspond to a region of interest as well as genomic sequences found at defined intervals on either side, i.e., 5′ and 3′ of, the region of interest, where the intervals may or may not be uniform, and may be tailored with respect to the particular region of interest and the assay objective. In other words, the tiling density may be tailored based on the particular region of interest and the assay objective. Such “tiled” arrays and assays employing the same are useful in a number of applications, including applications where one identifies a region of interest at a first resolution, and then uses tiled array tailored to the initially identified region to further assay the region at a higher resolution, e.g., in an iterative protocol.

In certain aspects, the array includes probes to sequences associated with diseases associated with chromosomal imbalances for prenatal testing. For example, in one aspect, the array comprises probes complementary to all or a portion of chromosome 21 (e.g., Down's syndrome), all or a portion of the X chromosome (e.g., to detect an X chromosome deficiency as in Turner's Syndrome) and/or all or a portion of the Y chromosome Klinefelter Syndrome (to detect duplication of an X chromosome and the presence of a Y chromosome), all or a portion of chromosome 7 (e.g., to detect William's Syndrome), all or a portion of chromosome 8 (e.g., to detect Langer-Giedon Syndrome), all or a portion of chromosome 15 (e.g., to detect Prader-Willi or Angelman's Syndrome, all or a portion of chromosome 22 (e.g., to detect Di George's syndrome).

Other “themed” arrays may be fabricated, for example, arrays including whose duplications or deletions are associated with specific types of cancer (e.g., breast cancer, prostate cancer and the like). The selection of such arrays may be based on patient information such as familial inheritance of particular genetic abnormalities. In certain aspects, an array for scanning an entire genome is first contacted with a sample and then a higher-resolution array is selected based on the results of such scanning.

Themed arrays also can be fabricated for use in gene expression assays, for example, to detect expression of genes involved in selected pathways of interest, or genes associated with particular diseases of interest.

In one embodiment, a plurality of probes on the array is selected to have a duplex T_(m) within a predetermined range. For example, in one aspect, at least about 50% of the probes have a duplex T_(m) within a temperature range of about 75° C. to about 85° C. In one embodiment, at least 80% of said polynucleotide probes have a duplex T_(m) within a temperature range of about 75° C. to about 85° C., within a range of about 77° C. to about 83° C., within a range of from about 78° C. to about 82° C. or within a range from about 79° C. to about 82° C. In one aspect, at least about 50% of probes on an array have range of T_(m)'s of less than about 4° C., less then about 3° C., or even less than about 2° C., e.g., less than about 1.5° C., less than about 1.0° C. or about 0.5° C.

The probes on the microarray, in certain embodiments have a nucleotide length in the range of at least 30 nucleotides to 200 nucleotides, or in the range of at least about 30 to about 150 nucleotides. In other embodiments, at least about 50% of the polynucleotide probes on the solid support have the same nucleotide length, and that length may be about 60 nucleotides.

In certain aspects, longer polynucleotides may be used as probes. In addition to the oligonucleotide probes described above, cDNAs, or inserts from phage BACs (bacterial artificial chromosomes) or plasmid clones, can be arrayed. Probes may therefore also range from about 201-5000 bases in length, from about 5001-50,000 bases in length, or from about 50,001-200,000 bases in length, depending on the platform used. If other polynucleotide features are present on a subject array, they may be interspersed with, or in a separately-hybridizable part of the array from the subject oligonucleotides.

In still other aspects, probes on the array comprise at least coding sequences.

In one aspect, probes represent sequences from an organism such as Drosophila melanogaster, Caenorhabditis elegans, yeast, zebrafish, a mouse, a rat, a domestic animal, a companion animal, a primate, a human, etc. In certain aspects, probes representing sequences from different organisms are provided on a single substrate, e.g., on a plurality of different arrays.

A “CGH assay” using an aCGH array can be generally performed as follows. In one embodiment, a population of nucleic acids contacted with an aCGH array comprises at least two sets of nucleic acid populations, which can be derived from different sample sources. For example, in one aspect, a target population contacted with the array comprises a set of target molecules from a reference sample and from a test sample. In one aspect, the reference sample is from an organism having a known genotype and/or phenotype, while the test sample has an unknown genotype and/or phenotype or a genotype and/or phenotype that is known and is different from that of the reference sample. For example, in one aspect, the reference sample is from a healthy patient while the test sample is from a patient suspected of having cancer or known to have cancer.

In one embodiment, a target population being contacted to an array in a given assay comprises at least two sets of target populations that are differentially labeled (e.g., by spectrally distinguishable labels). In one aspect, control target molecules in a target population are also provided as two sets, e.g., a first set labeled with a first label and a second set labeled with a second label corresponding to first and second labels being used to label reference and test target molecules, respectively.

In one aspect, the control target molecules in a population are present at a level comparable to a haploid amount of a gene represented in the target population. In another aspect, the control target molecules are present at a level comparable to a diploid amount of a gene. In still another aspect, the control target molecules are present at a level that is different from a haploid or diploid amount of a gene represented in the target population. The relative proportions of complexes formed labeled with the first label vs. the second label can be used to evaluate relative copy numbers of targets found in the two samples.

In certain aspects, test and reference populations of nucleic acids may be applied separately to separate but identical arrays (e.g., having identical probe molecules) and the signals from each array can be compared to determine relative copy numbers of the nucleic acids in the test and reference populations.

Arrays can be fabricated using drop deposition from pulse jets of either polynucleotide precursor units (such as monomers) in the case of in situ fabrication, or the previously obtained polynucleotide. Such methods are described in detail in, for example, the previously cited references including U.S. Pat. No. 6,242,266, U.S. Pat. No. 6,232,072, U.S. Pat. No. 6,180,351, U.S. Pat. No. 6,171,797, U.S. Pat. No. 6,323,043, U.S. patent application Ser. No. 09/302,898 filed Apr. 30, 1999 by Caren et al., and the references cited therein. As already mentioned, these references are incorporated herein by reference. Other drop deposition methods can be used for fabrication, as previously described herein. Also, instead of drop deposition methods, photolithographic array fabrication methods may be used. Interfeature areas need not be present particularly when the arrays are made by photolithographic methods as described in those patents.

Following receipt by a user, an array will typically be exposed to a sample (for example, a fluorescently labeled polynucleotide or protein containing sample) and the array then read. Reading of the array may be accomplished by illuminating the array and reading the location and intensity of resulting fluorescence at multiple regions on each feature of the array. For example, a scanner may be used for this purpose which is similar to the AGILENT MICROARRAY SCANNER manufactured by Agilent Technologies, Palo Alto, Calif. Other suitable apparatus and methods are described in U.S. patent applications: Ser. No. 10/087,447 “Reading Dry Chemical Arrays Through The Substrate” by Corson et al.; and in U.S. Pat. Nos. 6,518,556; 6,486,457; 6,406,849; 6,371,370; 6,355,921; 6,320,196; 6,251,685; and 6,222,664. However, arrays may be read by any other method or apparatus than the foregoing, with other reading methods including other optical techniques (for example, detecting chemiluminescent or electroluminescent labels) or electrical techniques (where each feature is provided with an electrode to detect hybridization at that feature in a manner disclosed in U.S. Pat. No. 6,251,685, U.S. Pat. No. 6,221,583 and elsewhere). A result obtained from the reading may be used in accordance with the techniques of the present invention in screening and finding multiple drug treatment therapies. A result of the reading (whether further processed or not) may be forwarded (such as by communication) to a remote location if desired, and received there for further use (such as further processing).

A “gene expression profile”, “expression profile”, “gene expression signature” or “expression signature” as used herein refers to a serial pattern (e.g., time series data) generated from a plot of gene expression values for a gene across all samples processed for gene expression. For example, ten different cell lines may be processed on ten different arrays. A gene expression profile for a particular gene “A” is generated by plotting differential gene expression values for gene “A” versus a control or reference, against each for each the ten cell lines against those cell lines and then forming a serial pattern to define the expression profile.

When one item is indicated as being “remote” from another, this is referenced that the two items are in different locations, e.g., in different rooms or buildings, and may be at least one mile, ten miles, or at least one hundred miles apart.

“Communicating” information references transmitting the data representing that information as electrical signals over a suitable communication channel (for example, a private or public network). “Forwarding” an item refers to any means of getting that item from one location to the next, whether by physically transporting that item or otherwise (where that is possible) and includes, at least in the case of data, physically transporting a medium carrying the data or communicating the data.

A “processor” references any hardware and/or software combination which will perform the functions required of it. For example, any processor herein may be a programmable digital microprocessor such as available in the form of a mainframe, server, or personal computer (desktop or portable). Where the processor is programmable, suitable programming can be communicated from a remote location to the processor, or previously saved in a computer program product (such as a portable or fixed computer readable storage medium, whether magnetic, optical or solid state device based). For example, a magnetic or optical disk may carry the programming, and can be read by a suitable disk reader communicating with each processor at its corresponding station.

Reference to a singular item, includes the possibility that there are plural of the same items present.

“May” means optionally.

Methods recited herein may be carried out in any order of the recited events which is logically possible, as well as the recited order of events.

CGH arrays may be custom designed to map out chromosome alterations, as noted above. As a result of chromosome alteration, there is a resultant alteration in transcripts (mRNA) produced by the alteration in genetic material on the chromosome. The altered mRNA in turn causes an abnormal or altered protein expression, relative to what would be seen in tissue having normal, non-altered chromosomes, which altered expression can lead to diseases and other abnormalities in the subject from which the tissue having altered chromosomal material originated.

FIG. 1A is a block diagram 100 of a cancer transcriptome model. The abnormal or altered mRNA resulting from the altered regions of the chromosomes of the cancerous tissue are at least a factor in causing the formation of a cancer tumor. The cancerous transcriptome can be characterized as a tumor gene expression profile. The tumor gene expression profile includes tumor gene expression 102 that results in altered or abnormal protein expression 104. The abnormal protein expression 104 influences, causes, enhances and/or continues tumor growth 106.

Because the field of proteomics is at an early stage, it is not currently sophisticated enough to analyze the protein expression 104 causally with respect to tumor growth 106. However, based on the model 100 of FIG. 1A, the present invention derives model 110 in FIG. 1B, which assumes that there is a proportional connection/relationship between tumor gene expression and tumor growth 106.

FIG. 2 is a flowchart 200 illustrating steps that may be performed for determining potential treatments for diseases or other maladies that may be caused by altered mRNA resulting from chromosome alteration. By designing one or more CGH arrays to effectively map the chromosomes of the tissue to be studied, gene sequences and possible variations in gene sequences can be predicted based on location analysis performed from results outputted by such CGH arrays. CGH implies what kinds of modifications, if any, have occurred in the chromosomes, such as amplification and or deletions. Based on these modifications, prediction of a comprehensive set of candidate mRNA sequences that may be expressed by the altered chromosome or chromosomes is made possible. For example, initially an assumption may be made that transcription initializers (e.g., promoters/primers or transcription factors that act upstream of the exons in enabling transcription) have not been altered, but have possibly been relocated. Consequently, a prediction of the transcriptions with altered sequences is made. Such assumption will tend to be foreign to editing processes, likely producing mangled (abnormal) RNA sequences. These abnormal sequences may modify cell processes in many ways, including abnormal translations and/or interference with other cellular processes. If transcription initializers are altered, then it is likely that the genes that originate the same are missing, or altered in some other way so as to render them dysfunctional. The custom array is designed to measure the altered transcriptome. For example, for studying a particular type of cancer, the custom would be designed to measure the predicted cancer transcriptome of the particular cancer being studied, as well as expected mRNA sequences for “normal” tissues of the same type that have not been affected by cancer. There may be a lot of variety in the alterations of the chromosomes for a particular malady being studied. For example, alterations may vary a lot in different stages of cancer, etc. Therefore, the custom array may be designed to capture as many potential variations in mRNA sequences that may be outputted by the altered tissue, given the potential variations in alteration of the chromosomes.

For treatment of cancer or other disease or malady causally related to chromosome alteration, it may be useful to know not only the locations where the chromosome(s) have been altered, but also what transcripts are expressed that may be causally related to the disease. For example, certain transcripts in a cancer transcriptome may cause the production of proteins that are necessary to allow a tumor to exist and/or proliferate.

After designing a custom array to measure the potential mRNA sequences that may be expressed in the abnormal tissue to be studied at event 202, or, alternatively, being provided with such a custom array, each cell line of the altered or abnormal tissues is run on one of such custom arrays to measure the altered transcriptome at event 204 via the probes on the array that are designed for mutations in mRNA predicted to be present in the altered tissues, as noted. Based on the measured differential expression ratios, gene expression signatures are generated at event 206. Thus, each differential expression of each gene measured in the altered tissue, relative to the “normal” tissue is measured and plotted, to generate gene expression signatures 302 across the cell lines, as exemplified in FIG. 3. Note that for simplicity in illustration and explanation, only three such gene expression signatures 302 (i.e., 302 a, 302 b, 302 c) are shown in FIG. 3. Typically, there may be thousands, if not tens of thousands of such gene expression signatures 302 generated.

Next, samples of each of the cell lines are provided with a treatment, and experimentation is performed to dose the treatment at a level that reduces the disease or malady by a predetermined amount at event 208. For example, for cancer lines, a dose of the treatment may be provided that reduces tumor growth by 50% in one day. Of course, this is only an example of the predetermined amount, as various other standards may be chosen. Alternatively, a predefined amount and duration of treatment may be provided, and the effect of such on each cell line may be measured. The treatment given may be a drug or compound or combinations of drugs or compounds, genetic sequences, radiation, thermal, electrical magnetic, or other forms of treatment expected to reduce the disease state, or combinations of the same. Once the desired treatment dosage has been determined, this treatment dosage is applied to the cell lines and, at about the time over which the predetermined reduction amount has been determined to occur, the treated cell lines/samples are processed, each on one of the custom arrays, for example.

A phenotypic response profile resultant from the treatment is then generated from measuring effects of the treatment on the cell lines (phenotypic response) in event 210. If a predetermined amount of treatment was applied to the cell lines for a predetermined time, then the amount of reduction in tumor growth is measured with respect to each cell line, and those amount values make up the response profile. The phenotypic response profile shows the signature of the variation in the phenotypic response (in this example, tumor reduction) to the treatment across the various cell lines, as the treatment impacts disease-active proteins in the cell lines samples, the levels of which disease-active proteins vary across samples. The production of the disease-active proteins is regulated by certain genes, as noted in the models of FIGS. 1A and 1B. If the treatment was provided to determine the amount of treatment that was required to reduce the tumor growth in the cell lines by a predetermined amount over a predetermined time, then the phenotypic response profile is generated by the amount of treatment required, with respect to each cell line, to reduce the tumor growth in that cell line by the predetermined amount, which varies across samples, thereby creating a meaningful phenotypic response profile for the treated samples/cell lines.

The response profile (treatment-impact profile) 402 is next compared with the gene expression signatures 302 of the untreated cell lines to look for expression signatures 302 that may be in synchronization or one hundred eighty degrees out of synchronization (anti-synchronization) with the treatment impact profile 402. Note that only one gene expression signature 302 is represented in plot 400 of FIG. 4 for illustration of the comparison with the response profile 402. However, all gene expression signatures 302 may be compared in this manner. For each cell line, the measured value of the effect of the treatment is plotted, and response profile 402 is generated by interconnecting the plotted data points with straight lines, as shown in FIG. 4. Synchronization depends only on the waveform of two signatures that are being compared, and not on the scale of each signature. However, covariance similarity will be larger for signatures with larger amplitudes, whereas correlation will not be. Hence, covariance-based clustering may be more meaningful.

Additionally, replicates may be run of each tissue sample (cell line) against each treatment, to perform an average response with regard to each, to reduce the noise in the phenotypic signatures. Likewise, replicate CGH arrays may be processed to reduce noise in the gene expression signatures resulting therefrom by generating an average gene expression signature for each gene from the average of the responses in the replicates. Additionally or alternatively, phenotypic signatures may be processed using “self-self” prediction techniques such as described in co-pending, commonly owned application Ser. No. 10/400,372 filed Mar. 27, 2003 and titled “Method and System for Predicting Multi-Variable Outcomes”, and in Application Ser. No. 60/368,586 filed Mar. 29, 2002 and titled “Generalized Similarity Least Squares Predictor”, both of which are incorporated, in their entireties, by reference thereto.

A phenotypic signature that is significantly in synchronization with or in anti-synchronization with a gene expression signature can be considered to be causally related to the gene from which the gene expression signature is derived. That is, when the treatment is applied, the resultant changes in the tissues (the phenotypic response) are proportional to the expression in that gene (from which the synchronized or anti-synchronized gene expression signature was generated) in the abnormal tissue relative to the expression of the same gene in the normal tissue, as it is hypothesized that a change (reduction or increase) in the proteins generated from expression of that gene has occurred in response to the treatment. Note that the phenotypic response, in this example, reduction of tumor growth, may be correlated with a decrease in diseased or otherwise abnormal (tissue malady) gene expression (synchronization), to cause a decrease in proteins that proliferate the disease, or an increase in gene expression (anti-synchronization) to cause an increase in proteins that inhibit tumor growth or to regulate (reduce) the production of proteins that cause the disease to proliferate. Although the phase link between the phenotypic response and gene expression signatures from diseased tissues are likely more complicated, since protein relationships are not able to be identified at this time, as noted elsewhere in this disclosure, approximations are made according to the techniques described herein based on identifying synchronized and anti-synchronized signatures, between phenotypic response signatures and gene expression signatures from tissues know to have a malady.

Typically, a plurality of treatments (with or without replicates) is provided to produce an equal number of phenotypic response signatures 402, to search for one or more potentially effective treatments for the abnormality being studied. To test for anti-synchronization, the phenotypic signatures are inverted.

FIG. 5 shows a matrix 500 in which the data points used to generate each of the gene expression signatures 302, phenotypic response signatures 402 and inverted phenotypic response signatures 402 i have been inserted as cell values of the matrix 500, with reach row representing a signature, and each column representing one of the cell lines. Thus, after appropriate transformation (e.g., smoothing and adjustment for phase shifts, if known) and normalization, each treatment-induced phenotypic signature 402 and inverse signature 402 i in FIG. 5 is compared with differential-expression signatures 302, to map treatments to the transcriptome locations of the correlated genes in the abnormal tissue samples.

An approach to try and find relationships among data on this order of magnitude may include generating a matrix containing the data points used to generate each of the gene expression signatures 302, and another matrix containing the data points used to generate the phenotypic response signatures 402 and inverted phenotypic response signatures 402 i and producing a correlation matrix having a number of values equal to the product of the number of gene expression signatures times the sum of the number of phenotypic response signatures and inverted phenotypic response signatures. Such a correlation matrix is generated by calculating the inner product of the values of the phenotypic response signatures 402 and inverse phenotypic response signatures 402 i with the values of the gene expression signatures 302, and typically centering by means and scaling by standard deviation. The inner product may be calculated as follows: $\begin{matrix} {\frac{{\overset{->}{D}}_{j} \cdot {\overset{->}{G}}_{k}}{N} = i_{j,k}} & (1) \end{matrix}$ where

-   {right arrow over (D_(j))} is the vector representing the phenotypic     expression or inverted phenotypic expression across samples (cell     lines) 402, 402 i, respectively, when treated with the j^(th)     treatment in the list, where j ranges from 1 to (m+m), since both     signatures and inverted signatures are considered, where m is an     integer representing the total number of treatments; -   {right arrow over (G)}_(k) is the vector representing the gene     expression signature 302 for the k^(th) gene expression signature     (gene), where k ranges from 1 to N;     -   N is the number of samples, e.g., 20 cell lines; and     -   i_(j,k) is the value in a correlation matrix filling the j^(th)         column and the k^(th) row.

Thus, the values in the correlation matrix are a cross product of normalized or standardized vectors, and may be optionally centered by subtracting a vector mean value. Once the correlation matrix is produced, then various clustering techniques may be applied to try and identify blocking patterns in the data, or “clusters” which might indicate that certain treatments are effecting certain groups of genes. Alternatively, clustering techniques may be applied to treatment and gene expression matrices prior to performing the cross product operation, in an effort to reduce the sizes to something less than what is started with.

The present inventor believes that important information is dropped or lost when such data is processed into a correlation matrix, such as by performing a cross product/inner product procedure to obtain the correlation matrix, as described above. When such an operation is performed, phase information is lost, as is well-known, particularly among electrical engineers. Because the phenotypic responses to the treatments at issue are measured as the output (i.e., phenotypic signature) and the inputs are the gene expression ratio values, there is a protein link between the input and output, as noted above with regard to FIGS. 1A and 1B, i.e., the gene expression instructs the generation of proteins which effect the disease state. Treatments impact disease proteins to produce the phenotypic output response. Thus, the present inventor believes that it is important to consider the phase relationships between the inputs (gene expression) and outputs (phenotypic responses), as important factors based on the protein links between the inputs and outputs.

Hence, the approach taken by the present invention is akin to a lumped parameters modeling approach. As noted above with regard to FIG. 1A, the genotypic signature (gene expression signature) is modeled as the input 102, the phenotypic signature is modeled as the output 106, and a “black box” 104 represents a transfer function that transforms input 102 to output 106. Note that input 102 and output 106 may be single inputs and outputs, e.g., a single gene and single drug impact or single protein production, or multiple inputs and outputs, e.g., groups of genes producing groups of proteins and/or multiple drug impacts. The present invention is adapted to further modifications within the black box as additional information becomes known about the relationships between genes and the production of proteins resultant therefrom. This will enable the input and output to be modeled, for example, like an electrical circuit, with black box 104 containing “resistances”, “inductances”, “capacitances”, “emfs”, etc., that model the best knowledge that is gained with regard to the relationships between the genes and proteins produced therefrom, which will enable the modeling of an accurate phase relationship between the gene expression signatures and the phenotypic response signatures. Such complete modeling knowledge is currently not known however, and it may be at least five years, or more, before such knowledge is obtained in sufficient detail. In the meantime, this technique assumes that an input signature is either in phase (0°, synchronized), or out of phase (i.e., 180° out of phase, anti-synchronized) with the output signature, as noted above.

Rather than taking a cross product and forming a correlation matrix, this methodology assembles all of the signatures, after appropriate transformation such as by use of Log transforms, into one large list of profiles or matrix of profile values 500, properly normalized, as shown in FIG. 5. Continuing with the example discussed above, the rows of matrix 500 include the gene expression signatures 302 and phenotypic response signatures 402 and inverted phenotypic response signatures 402 i.

Rather than clustering within a single zone (i.e., the zone containing the gene expression signatures 302 or the zone containing the phenotypic response signatures 402 and/or the inverted phenotypic response signatures 402 i, or performing a cross product correlation to combine these zones (thereby losing valuable information, as discussed above), clustering procedures are performed over the entire matrix 500 as a whole. Each of the signatures are normalized for comparison purposes. One method of normalizing used is by Z-scoring, although other normalization methods may be substituted. Also weighting techniques may be applied so that highly up-regulated features do not receive over-amplified attention during the clustering process. Further, noisy profiles may be weighted with relatively reduced weight.

Clustering of the information may be performed using any cluster technique that is capable of clustering similar signatures, such as K-means or other known techniques. However, it is preferred to perform the clustering using the tools and techniques described in co-pending, commonly owned application Ser. No. 09/986,746 filed Nov. 9, 2001 and titled “System and Method for Dynamic Data Clustering” which application is incorporated herein, in its entirety, by reference thereto. Similarity may be defined by the Euclidean distance between the normalized vectors in matrix 500. These dynamic clustering techniques are scalable, and paralellizable, so that they can handle large scale problems such as those presented in the context of the present invention. The result of these clustering operations gives clusters which contain mixtures of phenotypic response signatures 402, 402 i and gene expression signatures 302. Sometimes the clustering occurs among genes in phase and sometimes with genes out of phase. Some clusters may contain only phenotypic response signatures 402, 402 i and some clusters may contain only gene expression signatures 302. The clusters of interest to the present techniques contain both types of signatures to imply gene-drug associations.

FIG. 6 illustrates plots of three treatment/phenotypic profiles 602, 604, 606 resultant in NCI Lung Cancer Cell Lines having been treated with three different treatments, that were determined, by dynamic data clustering techniques described in application Ser. No. 09/986,746, to be in synchronization with gene expression profile 608. Visual observation of these plots makes it evident that the peaks and valleys of the phenotypic profiles 602, 604, 606 generally follow the same contour evident in the gene expression profile 608.

FIG. 7 is a schematic representation of an ellipsoid 700 which represents the plot of a cluster of vectors from a matrix such as matrix 500 or a similar matrix assembled, when the vectors are plotted in high dimensional space. The clustering techniques described above are designed to find and identify such clusters, as well as locate the densest part of each ellipsoid cluster, shown as diagonal or “ridge” 611 in FIG. 7. Using a dynamic data clustering system as disclosed in application Ser. No. 09/986,746, the system uses force functions to converge a mathematical probe to the densest part of a profile cluster. Thus, the system not only identifies the clusters but defines a point of reference for all profiles in the cluster, with respect to each cluster identified. The distance of each profile from the center of the cluster it belongs to can be defined by any viable distance metric. An example of a distance metric used is Euclidean distance. The relative angular positions of the profiles may also be consequential for selecting combinations of effective treatments.

Treatment 512 is noted as being a member of a group of similar treatments that are the closest to the densest location 610 of the ellipsoid 700. Although the distances measured do not identify angularity between profiles, or which side of the ridge 611 (dominant, principal axis running through the densest location) a particular vector (profile) lies on, the distance value does give an indication of how close to the ridge 611 and densest location 610 that vector lies. For example, the distance for treatment 512 is shown close to ridge 611 in FIG. 7. Another treatment vector 514 was measured to be further from ridge 611 as shown in FIG. 7. A third treatment vector 516 is shown at a somewhat intermediate distance from ridge 611, relative to distances for 512 and 514, and the treatment vector 518 is shown at an intermediate distance between the distances for 512 and 514.

One approach is to find a combination of treatments, such as those represented in FIG. 7, that show relationships to different genes in the clustered profile, so that each treatment appears to be related to strategic combinations of genes involved in the disease process. The treatments are then combined and tested together, each in a low dose/amount to observe whether a combinatorial synergistic effect on this disease is achieved by the combination. Serendipitously, the low dose/amount combination reduces the side effects of each individual treatment in the combination. Of course any combination where an adverse reaction occurs among two or more treatments in the combination used would be discarded as being unsuitable for use as a potential treatment combination. Useful combinations will specifically and effectively target the disease process being studied.

When testing with one or more identified treatments that provide symmetric and/or anti-symmetric phenotypic response profiles relative to one or more gene expression profiles thought to be causally connected to the abnormality being studied (e.g., tumor growth), if the results do not effect all of the genes thought to be responsible for the abnormality, then events 208 and 210 may be repeated with additional different treatments, and analysis of phenotypic responses from these additional treatments may be carried out to select alternative treatments or additional treatments to be added to the treatment regimen from which the most recent tests were conducted. Selection may be made by again selecting those treatments that produce phenotypic profiles that are in synchronization or anti-synchronization with gene expression profiles of interest (gene expression profiles of those genes thought to be effecting or effected by the abnormal condition).

Further, by identifying the treatment vectors as in the example shown above, various combinations of treatments in a treatment family (such as, for example, drugs in the drug family, or related compounds in a compound family) of each identified treatment vector may be tested in the manner described above, as related treatments in a treatment family (e.g., drugs in a drug family) will generally fall within similar relative distances from the densest location of the ellipsoid. In this way, the combinations of treatments may be predicted to try as potential combinations for multiple treatments in patients which will address the broadest spectrum of genes related to the production of proteins seen as elevated or inhibited when a tissue is in the disease state. By testing the predicted combinations, useful combinations for treatment in patients will be much easier to identify. The present invention is a forward-looking way of choosing and predicting specific combinations of treatments to test, e.g., using high-throughput (HTP) screening of treatment combinations, and as such, greatly reduces the time to finding successful combinations, which currently have only been discovered accidentally, through hindsight and experiences gained through individual treatments.

The treatments identified are targeted to the genes involved in the disease process/malady. Because of this, the chances of significant side effects are reduced. For those combinations found to be effective in the sample tissues, further testing, such as animal testing would be warranted to study any effects the treatments may have on normal tissues within an organism, since the testing with the disease tissue samples only proves that the combination of drugs applied is effective at treating the diseased tissues. For the cancer examples discussed above, testing with the tissue samples would only show that the combination of treatments effectively kills the cancer cells, or not. Animal testing would further show the effects of the treatment combination on the normal tissues in the organism, to see if the animal survives the treatment combination.

A technique for excluding potential treatments may also be carried out. One example of such an exclusion technique is to generate at least one phenotypic signature representing treatment-response values of each of the tissue samples exhibiting the malady, resultant from treating the tissue samples exhibiting the malady, with at least one treatment having known undesirable characteristics (e.g., is toxic to normal tissues, or ineffective, or has other undesirable side effects, etc.) for treatment of the tissues exhibiting the malady, using the techniques described above. This one or more phenotypic signature(s) are then included with all other signatures (e.g., the signatures generated in matrix 500) and then subjected to clustering as described above. Any phenotypic signature representing treatment-response values that are close to a phenotypic signature generated from a treatment known to have undesirable effects is then eliminated from candidacy for selection as a potential treatment. For example, a predefined distance may be established as a threshold, so that any phenotypic signature that is less than or equal to the predefined distance from a location of a phenotypic signature resulting from treatment with a treatment having known undesirable characteristics is eliminated.

Alteration of existing treatments for tissue maladies may also be performed. For example, a phenotypic signature representing the treatment-response values of each of the tissue samples exhibiting the malady, as effected by the original or existing single or combination treatment may be generated according to the techniques described above. Those tissue samples may then be treated with one or more treatments not included in the original or existing single or combination treatment and then one or more phenotypic signatures may be generated from the treatment-response values of the tissues resulting from the one or more treatments. Clustering may then be performed, as described above, based on the gene expression signatures of the differential expression levels and the phenotypic signatures of the treatment-response values together, including both the phenotypic signature from the original treatment and the phenotypic signature(s) from the additional treatment(s). At least one additional treatment may then be selected for incorporating with the original treatment, by identifying the treatment-response phenotypic signature(s) caused by the additional treatment(s), and which are clustered with phenotypic signatures identifying the treatment-response phenotypic signature(s) caused by the treatment or treatments in the original treatment, as well as with gene expression signatures representing differential expression levels representative of the diseased tissue samples, but separated from the phenotypic signatures identifying the treatment-response phenotypic signatures caused by the treatment or treatments in the original treatment, so as to address malady-related gene activity not currently addressed by the treatment or treatments in the original or existing treatment.

Protein pathways, implicated by differential gene expression levels when comparing treated tissues to non-treated tissue and among diseased and non-diseased tissues are used to produce phase relations between treatment responses and expression profiles across the tissue samples being tested. An example of such is the twenty cancer cell lines referred to above. Using the techniques described above, all phase-related profiles are normalized and clustered. The number and sizes of the resulting clusters may indicate their relative importance as to effective treatments. The structure of each cluster infers treatment-gene associations to guide multi-treatment selections.

FIG. 8 illustrates a typical computer system in accordance with an embodiment of the present invention. The computer system 800 may include any number of processors 802 (also referred to as central processing units, or CPUs) that are coupled to storage devices including primary storage 806 (typically a random access memory, or RAM), primary storage 804 (typically a read only memory, or ROM). As is well known in the art, primary storage 804 acts to transfer data and instructions uni-directionally to the CPU and primary storage 806 is used typically to transfer data and instructions in a bi-directional manner Both of these primary storage devices may include any suitable computer-readable media such as those described above. A mass storage device 808 is also coupled bi-directionally to CPU 802 and provides additional data storage capacity and may include any of the computer-readable media described above. Mass storage device 808 may be used to store programs, data and the like and is typically a secondary storage medium such as a hard disk that is slower than primary storage. It will be appreciated that the information retained within the mass storage device 808, may, in appropriate cases, be incorporated in standard fashion as part of primary storage 806 as virtual memory. A specific mass storage device such as a CD-ROM 814 may also pass data uni-directionally to the CPU.

CPU 802 is also coupled to an interface 810 that includes one or more input/output devices such as video monitors, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, or other well-known input devices such as, of course, other computers. CPU 802 optionally may be coupled to a computer or telecommunications network using a network connection as shown generally at 812. With such a network connection, it is contemplated that the CPU might receive information from the network, or might output information to the network in the course of performing the above-described method steps. The above-described devices and materials will be familiar to those of skill in the computer hardware and software arts.

The hardware elements described above may implement the instructions of multiple software modules for performing the operations of this invention. For example, instructions for clustering vectors may be stored on mass storage device 808 or 814 and executed on CPU 808 in conjunction with primary memory 806.

In addition, embodiments of the present invention further relate to computer readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer-implemented operations. The media and program instructions may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to, magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM, CDRW, DVD-ROM, or DVD-RW disks; magneto-optical media such as floptical disks; and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM). Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.

While the present invention has been described with reference to the specific embodiments thereof, it should be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation, treatment, tissue sample, process, process step or steps, to the objective, spirit and scope of the present invention. All such modifications are intended to be within the scope of the claims appended hereto. 

1. A method for discovering a combination of treatments to reduce the progress of, or eliminate a tissue malady, said method comprising the steps of: (a) measuring gene expression values of at least one sample of tissue exhibiting the tissue malady and at least one reference sample tissue that does not exhibit the malady, using at least one CGH array designed to measure gene sequences and possible variations in gene sequences attributable to the malady; (b) generating gene expression signatures from differential expression values of ratios of the measured gene expression values between the at least one sample exhibiting the malady and the at least one reference sample, across all samples, respectively; (c) treating the at least one tissue sample exhibiting the malady with a treatment; (d) measuring a treatment-response value with respect to each of the tissue samples treated, as effected by the treatment; (e) generating a phenotypic signature representing the treatment-response values of each of the tissue samples treated; (f) repeating steps (c)-(e) with a different treatment at least once so that multiple phenotypic signatures have been generated for multiple treatments; (g) performing a clustering operation based on the gene expression signatures of the differential expression levels and the phenotypic signatures of the treatment-response values together; and (h) selecting treatments by identifying the treatment-response phenotypic signatures caused by those treatments, and which are clustered with gene expression signatures representing differential expression levels representative of the at least one tissue sample exhibiting the malady.
 2. The method of claim 1, further comprising designing the CGH array designed to measure gene sequences and possible variations in gene sequences attributable to the malady; and providing the CGH array for said measuring gene expression values.
 3. The method of claim 1, wherein each said treatment is selected from the group consisting of: a drug, a combination of drugs, a compound, a combination of compounds, radiation, a genetic sequence, a combination of genetic sequences, heat, cryogenics and a combination of two or more of any of the previous members in this group.
 4. The method of claim 1, further comprising the steps of: labeling the phenotypic signatures as “in phase” signatures; generating “out of phase” signatures by inverting the “in phase” signatures; and including the “out of phase” signatures with the “in phase” signatures and the gene expression signatures when performing steps (g) and (h).
 5. The method of claim 1, wherein said clustering operation includes finding a density center of a cluster, and calculating distances of the phenotypic signatures, belonging to the cluster, from the density center.
 6. The method of claim 5, wherein the selection of treatments is made to address a broad spectrum of genes involved in the process of the malady.
 7. The method of claim 6, wherein the treatments are selected by selecting treatment-response signatures within a cluster and having varying distances from the density center.
 8. The method of claim 1, wherein said phenotypic signatures are normalized prior to said clustering.
 9. The method of claim 4, wherein said phenotypic signatures are normalized prior to said clustering.
 10. The method of claim 1, wherein said CGH array is processed on a two-color, two channel microarray apparatus to measure said gene expression values.
 11. The method of claim 1, wherein said gene expression values are measured on a single channel microarray apparatus, wherein one of said CGH arrays per sample is used to process each sample exhibiting the malady and one of said CGH arrays per sample is used to process each reference sample.
 12. The method of claim 1, wherein each treatment-response value comprises a concentration level or amount of the treatment used to block or retard the progress of the malady by a predetermined percentage over a predetermined period of time after treatment.
 13. The method of claim 1, wherein each treatment-response value comprises a value characterizing the amount of blocking or retardation of the malady over a predetermined period of time after treatment with a fixed amount of the treatment.
 14. The method of claim 1, further comprising generating at least one phenotypic signature representing treatment-response values of each of the tissue samples exhibiting the malady, resultant from treating the tissue samples exhibiting the malady, with at least one treatment having known undesirable characteristics for treatment of the tissues exhibiting the malady; including that at least one phenotypic signature resulting from said treatment having known undesirable characteristics with all other signatures included in performing the clustering step (g); and discarding any phenotypic signature representing treatment-response values from candidacy for the selection step (h) when the phenotypic signature is less than or equal to a predefined distance from a location of the at least one phenotypic signature resulting from treatment with a treatment having known undesirable characteristics.
 15. The method of claim 14, wherein said known undesirable characteristics comprise an unacceptable level of toxicity.
 16. The method of claim 14, wherein said known undesirable characteristics comprise an insufficient efficacy.
 17. A method for screening a combination of treatments to select treatments for tissue exhibiting a malady, said method comprising the steps of: (a) providing differential expression levels of tissue samples exhibiting the malady relative to at least one reference tissue sample from respective features of CGH arrays designed to measure gene sequences and possible variations in gene sequences attributable to the malady; (b) for respective differential expression levels from respective features of respective CGH arrays for each tissue sample exhibiting the malady, providing a gene expression signature representing the differential expression level for each tissue sample for gene expression levels from that feature, respectively; (c) providing a treatment-response value, for each tissue sample exhibiting the malady having been treated with a treatment, as effected by the treatment; (d) generating a phenotypic signature representing the treatment-response values of each of the tissue samples having been treated; (e) repeating steps (c)-(d) with a different treatment at least once so that multiple phenotypic signatures have been generated for multiple treatments; (f) performing a clustering operation based on the gene expression signatures of the differential expression levels and the phenotypic signatures of the treatment-response values together; and (g) selecting treatments by identifying the treatment-response phenotypic signatures caused by those treatments, and which are clustered with gene expression signatures representing differential expression levels representative of the tissue samples exhibiting the malady.
 18. A method of augmenting an original or existing single treatment or treatment combination for a disease with at least one additional treatment that covers gene activity of the disease not addressed by the original or existing treatment, said method comprising the steps of: (a) providing differential expression levels of diseased tissue samples relative to at least one reference tissue for respective features of CGH arrays designed to measure gene sequences and possible variations in gene sequences attributable to the disease; (b) for respective features of respective CGH arrays for each diseased tissue sample, providing a gene expression signature representing the differential expression level for each tissue sample for that feature, respectively; (c) treating the diseased tissue samples with the original or existing single treatment or combination treatment; (d) measuring a treatment-response value with respect to each of the diseased tissue samples as effected by the original or existing single or combination treatment; (e) generating a phenotypic signature representing the treatment-response values of each of the diseased tissue samples as effected by the original or existing single or combination treatment; (f) treating the diseased tissue samples with a treatment that is not included in the original or existing single or combination treatment; (g) measuring a treatment-response value with respect to each of the diseased tissue samples as effected by the treatment that is not included in the original or existing single or combination treatment; (h) generating a phenotypic signature representing the treatment-response values of each of the diseased tissue samples as effected by the treatment that is not included in the original or existing single or combination treatment; (i) repeating steps (f)-(h) with a different treatment that is also not included in the original or existing single or combination treatment at least once so that multiple phenotypic signatures have been generated for multiple treatments not included in the original or existing single or combination treatment; (j) performing a clustering operation based on the gene expression signatures of the differential expression levels and the phenotypic signatures of the treatment-response values together; and (k) selecting at least one treatment by identifying the treatment-response phenotypic signatures caused by the at least one treatment, and which are clustered with phenotypic signatures identifying the treatment-response phenotypic signatures caused by the treatment or treatments in the original treatment, as well as with gene expression signatures representing differential expression levels representative of the diseased tissue samples, but separated from the phenotypic signatures identifying the treatment-response phenotypic signatures caused by the treatment or treatments in the original treatment, so as to address disease-gene activity not currently addressed by the treatment or treatments in the original or existing treatment.
 19. The method of claim 18, wherein each said treatment is selected from the group consisting of: a drug, a combination of drugs, a compound, a combination of compounds, radiation, a genetic sequence, a combination of genetic sequences, heat, cryogenics and a combination of two or more of any of the previous members in this group.
 20. The method of claim 18, further comprising the steps of: labeling the phenotypic signatures as “in phase” signatures; generating “out of phase” signatures by inverting the “in phase” signatures; and including the “out of phase” signatures with the “in phase” signatures and the gene expression signatures when performing steps (j) and (k).
 21. A system for discovering a combination of treatments to reduce the progress of, or eliminate a tissue malady, said system comprising: means for generating a gene expression signature representing differential expression levels of each of a plurality of tissue samples exhibiting the malady, relative to at least one reference tissue from gene expression values determined from respective features of CGH arrays designed to measure gene sequences of the tissues and possible variations in gene sequences attributable to the malady; means for measuring a treatment-response value with respect to each of the tissue samples exhibiting the malady, after treating each tissue sample exhibiting the malady with a treatment; means for generating a phenotypic signature representing the treatment-response values of each of the tissue samples having been treated; and means for performing a clustering operation while considering the gene expression signatures of the differential expression levels and the phenotypic signatures of the treatment-response values together.
 22. The system of claim 21, further comprising at least one of said CGH arrays designed to measure gene sequences of the tissues and possible variations in gene sequences attributable to the malady.
 23. The system of claim 21, further comprising microarray apparatus for processing the tissue samples exhibiting the malady and the at least one reference tissue to obtain the differential expression levels of the diseased tissues relative to the at least one reference tissue.
 24. The system of claim 21, wherein multiple treatments are successively and independently applied to treat the tissues exhibiting the malady, with respective treatment-response values measured for each and a treatment-response phenotypic signature is generated for each treatment applied.
 25. The system of claim 21, further comprising means for generating out-of-phase phenotypic signatures by inverting said phenotypic signatures.
 26. The system of claim 25, wherein said means for clustering includes said out-of-phase phenotypic signatures with said gene expression signatures said phenotypic signatures of the treatment-response values when performing said clustering operation.
 27. The system of claim 21, further comprising means for determining a center of density of a cluster identified by said means for clustering, and means for determining a distance of a phenotypic signature found to belong to said cluster, from said center of density.
 28. A computer readable medium carrying one or more sequences of instructions for discovering a combination of treatments to reduce the progress of, or eliminate a tissue malady, wherein execution of one or more sequences of instructions by one or more processors causes the one or more processors to perform the steps of: (a) measuring gene expression values of at least one sample of tissue exhibiting the tissue malady and at least one reference sample tissue that does not exhibit the malady, using at least one CGH array designed to measure gene sequences and possible variations in gene sequences attributable to the malady; (b) generating gene expression signatures from differential expression values of ratios of the measured gene expression values between the at least one sample exhibiting the malady and the at least one reference sample, across all samples, respectively; (c) treating the at least one tissue sample exhibiting the malady with a treatment; (d) measuring a treatment-response value with respect to each of the tissue samples treated, as effected by the treatment; (e) generating a phenotypic signature representing the treatment-response values of each of the tissue samples treated; (f) repeating steps (c)-(e) with a different treatment at least once so that multiple phenotypic signatures have been generated for multiple treatments; (g) performing a clustering operation based on the gene expression signatures of the differential expression levels and the phenotypic signatures of the treatment-response values together; and (h) selecting treatments by identifying the treatment-response phenotypic signatures caused by those treatments, and which are clustered with gene expression signatures representing differential expression levels representative of the at least one tissue sample exhibiting the malady. 